Search CORE

63 research outputs found

Recommended from our members

Parameterizing Phrase Based Statistical Machine Translation Models: An Analytic Study

Author: Cer Daniel
Publication venue: University of Colorado Boulder
Publication date: 01/01/2011
Field of study

The goal of this dissertation is to determine the best way to train a statistical machine translation system. I first develop a state-of-the-art machine translation system called Phrasal and then use it to examine a wide variety of potential learning algorithms and optimization criteria and arrive at two very surprising results. First, despite the strong intuitive appeal of more recent evaluation metrics, training to these metrics is no better than the older traditional approach of training to BLEU. Second, the most widely used learning algorithm for training machine translation systems, called minimum error rate training (MERT), works no better than standard machine learning algorithms such as log-linear models. This result demonstrates that machine translation does not require using a special purpose learning algorithm, but rather can be approached in a manner similar to other natural language processing and machine learning tasks. These results have a number of important implications. Contrary to existing beliefs, work on improving machine translation evaluation metrics and then training to the improved metrics will not in itself result in improved translation systems. Even more significantly, the widespread usage of MERT has limited the sort of models that can be used for machine translation, as it does not scale well to large numbers of features. If it is not necessary to use MERT to train competitive systems, machine translation can be treated similarly to any other natural language processing task with models that include arbitrarily large feature sets

CU Scholar Institutional Repository

Phrasal: A Toolkit for New Directions in Statistical Machine Translation

Author: Christopher D. Manning
Daniel Cer
Spence Green
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

We present a new version of Phrasal, an open-source toolkit for statistical phrase-based machine translation. This revision includes features that support emerging re-search trends such as (a) tuning with large feature sets, (b) tuning on large datasets like the bitext, and (c) web-based interactive ma-chine translation. A direct comparison with Moses shows favorable results in terms of decoding speed and tuning time.

CiteSeerX

Crossref

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Author: Cer Daniel
Lin Jimmy
Ni Jianmo
Thakur Nandan
Wieting John
Ábrego Gustavo Hernández
Publication venue
Publication date: 09/11/2023
Field of study

Dense retrieval models have predominantly been studied for English, where models have shown great success, due to the availability of human-labeled training pairs. However, there has been limited success for multilingual retrieval so far, as training data is uneven or scarcely available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop SWIM-IR, a synthetic retrieval training dataset containing 33 (high to very-low resource) languages for training multilingual dense retrieval models without requiring any human supervision. To construct SWIM-IR, we propose SAP (summarize-then-ask prompting), where the large language model (LLM) generates a textual summary prior to the query generation step. SAP assists the LLM in generating informative queries in the target language. Using SWIM-IR, we explore synthetic fine-tuning of multilingual dense retrieval models and evaluate them robustly on three retrieval benchmarks: XOR-Retrieve (cross-lingual), XTREME-UP (cross-lingual) and MIRACL (monolingual). Our models, called SWIM-X, are competitive with human-supervised dense retrieval models, e.g., mContriever, finding that SWIM-IR can cheaply substitute for expensive human-labeled retrieval training data.Comment: Data released at https://github.com/google-research-datasets/swim-i

arXiv.org e-Print Archive